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Abstract: The effective use of humanoid robots in space will depend upon 
the efficacy of interaction between humans and robots. The key to 
achieving this interaction is to provide the robot with sufficient skills for 
natural communication with humans so that humans can interact with the 
robot almost as though it were another human. This requires that a number 
of basic capabilities be incorporated into the robot, including voice 
recognition, natural language, and cognitive tools on-board the robot to 
facilitate interaction between humans and robots through use of common 
representations and shared humanlike behaviors. 
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1. INTRODUCTION 

Humanoid robots are now being built to assist 
humans in a variety of activities. The Robonaut 
platform (Ambrose, et al, 2000) is a humanoid robot 
designed to assist human astronauts working in space 
(Figure 1). Demonstrated teleoperative capabilities 
of Robonaut include dexterous grasping, tool 
handling, and cooperation with human teammates on 
manual tasks that require both information exchange 
and physical interaction between robot and human. 
Current development efforts are focused upon the 
implementation of autonomous behaviors and 
capabilities for Robonaut. 

Effective collaboration between robots and humans 
requires the use of an efficient interface whereby a 
human can communicate and interact with a robot 
almost as easily as with another human. In this 
interaction the human may act as a supervisor and/or 


collaborator with the robot. Human-robot 
collaboration is facilitated by a number of 
capabilities built into the interface and the robot 
itself, including voice recognition, natural language 
and gesture understanding, and behaviors supporting 
dynamic autonomy (Sofge, et al., 2003). The 
inclusion of cognitively plausible representations and 
processes incorporated into the robot provides a 
further basis for facilitating collaboration between 
humans and robots, thereby reducing the human 
effort required to adapt to limitations of the robot as 
a non-human collaborator. Use of a cognitive model 
aboard the robot facilitates better communication and 
interaction between the human and the robot through 
use of a common representational framework for the 
environment and objects within it, processing of 
sensor information, and joint problem solving 
involving both humans and robots. 
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In this paper we focus on the use of cognitive models 
of spatial reasoning capabilities aboard Robonaut to 
enhance communications between human astronauts 
and Robonaut, thereby enabling collaborative teams 
consisting of human astronauts and humanoid robots. 


2. COGNITIVE HUMANOID ROBOTS 

Achieving effective collaboration between humans 
and robots will require the use of cognitive models 
on-board the robots. Embodied cognition, as we call 
it, uses cognitive models of human performance to 
augment a robot’s reasoning capabilities, and 
facilitates human-robot interaction in two ways. 
First, we hypothesize that the more a robot behaves 
like a human being, the easier it will be for humans 
to predict and understand its behavior and interact 
with it. Second, if humans and robots share at least 
some of the representational structures in their 
interactive communications or activities, we further 
hypothesize that coimnunication among them will be 
much easier. For example, in tasks requiring 
direction generation, humans naturally use 
qualitative spatial relationships (Miller and Johnson- 
Laird, 1976; Tversky, 1993) such as “left,” “up”, 
“east,” or “north.” Interaction with a robot capable 
of manipulating the same representation instead of 
traditional real number matrices would be more 
natural and efficient. In (Bugajska, et al., 2002) and 
(Trafton, et al., 2003) we used cognitive models of 
human performance of the task to augment the 
capabilities of robotic systems. 

We are investigating the use of two cognitive 
architectures based on human cognition for certain 
high-level control mechanisms aboard Robonaut. 
These cognitive architectures are ACT-R/S (Harrison 
and Schunn, 2003) based upon the ACT-R 
architecture (Anderson and Lebiere, 1998) and 
Polyscheme (Cassimatis, 2002). 

ACT-R is one of the most prominent cognitive 
architectures to have emerged in the past two 
decades as a result of the information processing 
revolution in the cognitive sciences. Also called a 
unified theory of cognition, ACT-R is a relatively 
complete theory about the stmcture of human 
cognition that strives to account for the full range of 
cognitive behavior with a single, coherent set of 
mechanisms. Its chief computational claims are: 
first, that cognition functions at two levels, one 
symbolic and the other sub-symbolic; second, that 
symbolic memory has two components, one 
procedural and the other declarative; and third, that 
the sub-symbolic performance of memory is 
optimized in response to the statistical structure of 
the environment. These theoretical claims are 
implemented as a production-system modeling 
enviromnent. The theory has been successfully used 
to account for human performance data in a wide 



variety of domains including memory for goals 
(Altmaim and Trafton, 2002), human computer 
interaction (Anderson, et al., 1997), and scientific 
discovery (Schunn and Anderson, 1998). 

The ACT-R/S system uses three different spatial 
represent-tations suggested in (Harrison and Schunn, 
2003). Briefly, they suggest that people use a focal 
representation for object identification that consists 
of non-metric geons, a manipulative representation 
for grasping and tracking that consists of metric 
geons, and a configural representation for navigation 
that consists of rectangular bounding regions. While 
a coarse representation is adequate for obstacle 
avoidance in navigation, in order to do perspective 
taking, we must spatially transform an object, 
focusing on the configural representation to 
determine that object’s spatial references. These 
transformations must then be mapped onto the user’s 
representations so that actions can be performed. We 
use ACT-R/S to create cognitively plausible models 
of human performance of tasks to be performed by 
the robots. 

Furthermore, we are using Cassimatis’ Poly scheme 
architecture (Cassimatis, 2002) for spatial, temporal 
and physical reasoning. The Polyscheme cognitive 
architecture enables multiple representations and 
algorithms (including ACT-R models), encapsulated 
in “speciahsts” to be integrated into inferences that 
agents make about a situation or about possible 
situations. Finally, we use an updated version of the 
Polyscheme implementation of a physical reasoner to 
help keep track of the robot’s physical enviromnent. 

2.1 Perspective-Taking 

One of the features of human cognition that 
facilitates natural human-robot interaction is 
“perspective-taking”. In order to understand 
utterances such as “the wrench on my left,” the robot 
must be able to reason from the perspective of the 
speaker to resolve the meaning of “my left”. We will 


use the Polyseheme arehitecture and ACT-R models 
to endow the robot with the ability to conceive of 
task-oriented goals and knowledge of another person. 
This will allow the robot to more easily predict and 
explain its own behavior, as well as the behavior of 
others, making it a better partner in a collaborative 
activity. 

Polyseheme has a simulation mechanism, called a 
“world” in which the robot is endowed with 
perspective-taking capabilities. Polyseheme allows 
the robot to reason about what it sees in its 
immediate environment from different perspectives. 
Using worlds, Polyseheme can simulate the 
perspective it would have at other times, different 
places and in other hypothetical worlds, and use its 
specialists to make inferences within those 
perspectives. Polyseheme uses reasoning algorithms 
such as counterfactual reasoning, backtracking 
search, truth-maintenance and stochastic simulation. 
We have created a specialist to reason about the 
perspective(s) of other people. This allows 
Polyseheme to predict and explain other people’s 
behavior, using its perceptual, motor, procedural, 
memory, spatial and physical specialists from the 
perspective of another person. 

For both ACT-R/S and Polyseheme we have created 
preliminary models that can perform simple spatial 
perspective-taking tasks. There seem, however, to 
be advantages and disadvantages to both systems: 
For example, ACT-R/S has more difficulty doing 
large scale simulations, but has a large amount of 
historical cognitive plausibihty (e.g., there have been 
a large number of empirical and psychological 
studies validating ACT-R/S), while Polyseheme has 
comparatively less of a cognitive history. 
Additionally, because the representations and 
operations of each system are a bit different, their 
behaviors are different and various tasks may be 
easier or more straightforward to model for one 
system than for another. 


3. NATURAL LANGUAGE UNDERSTANDING 

While ACT-R/S and Polyseheme enable us to 
embody a common cognitive model for tasks and 
spatial reasoning, we employ a natural language and 
gesture understanding system by which a human and 
robot can communicate this information to each 
other. Our rationale here is that the use of as natural 
an interface as possible again facilitates interaction. 
Our natural language interface combines a 
commercial speech recognition front-end with an in- 
house developed deep parsing system, NAUTILUS 
(Wauchope, 1994). ViaVoice'*^^ is used to translate 
the speech signal into text, which is then passed to 
our natural language understanding system. Nautilus, 
to produce both syntactic and semantic 
interpretations. The semantic interpretation. 


interpreted gestures from the vision system, and 
coimnand inputs from the computer or other 
interfaces are compared, matched and resolved in the 
coimnand interpretation system. 

Using our multimodal interface (Figure 2) the human 
user can interact with the robot using both natural 
language and gestures. The semantic interpretation 
is linked, where necessary, to gesture information via 
the Gesture Interpreter, Goal Tracker/Spatial 
Relations component, and Appropriateness/Need 
Filter, and an appropriate robot action or response 
results. 



Fig. 2. NRL Multimodal Interface 


For example, the human user can ask the robot “How 
many objects do you see?” ViaVoice'^^ analyzes the 
speech signal, producing a text string. NAUTILUS 
parses the string and produces a representation 
something like the following, simplified here for 
expository purposes. 

(ASKWH (1) 

(MANY N3 (iCLASS OBJECT) PLURAL) 
(PRESENT #:V7791 
(iCLASS P-SEE) 

CAGENT (PRONNl (:CLASS SYSTEM)YOU)) 
(:THEME N3))) 

The parsed text string is mapped into a kind of 
semantic representation (1). The various verbs or 
predicates of an utterance (e.g. see) are mapped into 
corresponding semantic classes ip-see) that have 
particular argument stmctures {agent, theme)', for 
example “you” is the agent of the p-see class of verbs 
in this domain and “objects” is the theme of this 
verbal class, represented as “N3”—a kind of co¬ 
indexed trace element in the theme slot of the 
predicate, since this element is syntactically fronted 
in English wh-questions. If the spoken utterance 
requires a gesture for disambiguation, as in for 

















example the sentenee “Look over there,” the gesture 
eomponents obtain and send the appropriate gesture 
to the Goal Traeker eomponent whieh eombines 
linguistie and gesture information. 


4. SPATIAL LANGUAGE 

As human operators we often think in terms of the 
relative spatial positions of objeets, and we use sueh 
relational linguistie terminology naturally in 
eommunieating with our human eolleagues. For 
example, a speaker might say, “Hand me the wreneh 
on the table.” If the assistant eannot find the wreneh, 
the speaker might say, “The wreneh is to the left of 
the toolbox.” The assistant need not be given preeise 
eoordinates for the wreneh but ean look in the area 
speeified using the spatial relational terms. 

In a similar manner, this type of spatial language ean 
be helpful for intuitive eommunieation with a robot 
in many situations. Relative spatial terminology ean 
be used to limit a seareh spaee by foeusing attention 
in a speeified region, as in “Look to the left of the 
toolbox and find the wreneh.” It ean be used to issue 
robot eommands, sueh as “Piek up the wreneh on the 
table.” A sequential eombination of sueh direetives 
ean be used to deseribe and issue a high level task, 
sueh as, “Find the toolbox on the table behind you. 
The wreneh is on the table to the left of the toolbox. 
Piek it up and bring it baek to me.” Finally, spatial 
language ean also be used by the robot to deseribe its 
environment, thereby providing a natural linguistie 
deseription of the environment, sueh as, “There is a 
wreneh on the table to the left of the toolbox.” 

In all of these eases the spatial language inereases the 
dynamie autonomy of the system by giving the 
human operator a less restrietive vemaeular for 
eommunieating with the robot and viee versa. 
However, the examples above also assume some 
level of objeet reeognition by the robot. 

To address the objeet reeognition problem, we use 
natural language to address spatial relations, thereby 
assisting humans interaeting with robots in 
reeognition and labeling of objeets (Skubie, et al,. 
2002). Given our natural language interfaee, a human 
ean easily eommunieate with the robot about objeets 
and spatial relations in the environment through the 
use of a dialog that is easy and natural for the human. 
Furthermore, onee an objeet is labeled, the user ean 
then issue additional eommands using natural spatial 
terms and by refereneing the named objeet. An 
example is given in (2): 

Human: “How many objeets do you see?” (2) 

Robot: “I see 4 objeets.” 

Human: “Where are they loeated?” 


Robot: “There are two objeets in front of me, 

one objeet on my right, and one objeet 
behind me.” 

Human: “The nearest objeet in front of you is a 
toolbox. Plaee the wreneh to the left 
of the toolbox.” 

Establishing a eommon frame is neeessary so that it 
is elear what is meant by spatial referenees generated 
both by the human operator as well as by the robot. 
Thus, if the human eommands the robot, “Turn left,” 
the robot must know whether the operator is referring 
to the robot’s left or the operator’s left. Likewise, in 
a human-robot dialog, if the robot plaees a seeond 
objeet “just to the left of the first objeet,” both need 
to know if the goal loeation is to the left of the robot 
or the human. 

Currently, eommands using spatial referenees (e.g., 
“Go to the right of the table”) assume an extrinsie 
referenee frame of the objeet (table) and are based on 
the robot’s viewing perspeetive to be eonsistent with 
Grabowski’s “outside perspeetive” (Grabowski, 
1999). That is, the spatial referenee assumes the 
robot is faeing the referent objeet. 

Although there has been eonsiderable researeh on the 
linguisties of spatial language for humans, there has 
been only limited work done in using spatial 
language for interaeting with robots. Some 
researehers have proposed a framework for sueh an 
interfaee (Muller, et al., 2000). Moratz (2001) 
investigated the spatial referenees used by human 
users to eontrol a mobile robot. An interesting 
finding is that the test subjeets eonsistently used the 
robot’s perspeetive when issuing direetives, in spite 
of the 180-degree rotation. At first, this may seem 
ineonsistent with human to human eommunieation. 
However, in human to human experiments, Tversky 
(1999) observed a similar result and found that 
speakers took the listener’s perspeetive in tasks 
where the listener had a signifieantly higher 
eognitive load than the speaker. 

The experiments by Moratz (2001) provide rationale 
for using the robot’s viewing perspeetive. We are 
eurrently investigating this further through use of 
human-faetors experiments where individuals who 
do not know the spatial reasoning eapabilities and 
limitations of the robot provide instmetions to the 
robot for performing various tasks where spatial 
refereneing is required (Perzanowski, et al, 2003). 
The results of this study will be used to enhanee the 
multimodal interfaee by establishing a eommon 
language for spatial refereneing whieh ineorporates 
those eonstruets and utteranees most frequently used 
by untrained operators for eommanding the robot. 



5. CONCLUSIONS 

Humanoid robots are being designed and built to 
provide assistance to humans in complex and 
challenging work environments, such as outer space. 
Achieving effective use of these humanoid robots in 
space will depend upon the difficulty of the tasks 
required of human astronauts interacting with robots. 
The use of cognitive tools aboard the robots provides 
a number of benefits, such as shared representations, 
behaviors, and modes of interaction between humans 
and robots, thereby easing the cognitive load on the 
part of the human. The key to achieving effective 
interaction is to provide the robot with sufficient 
skills for natural communication with humans so that 
humans can interact with the robot almost as easily 
as with another human 

This paper describes the design, implementation, and 
capabilities of a robotic system architecture for a 
robot which can be used (at some level) to 
collaborate with a human. The capabilities required 
of the robot include voice recognition, natural 
language understanding, gesture recognition, spatial 
reasoning, and cognitive modeling with perspective¬ 
taking. These represent a small subset of potential 
capabilities humans utilize with one another in 
collaborating to perform a task in a complex 
environment, and barely scratches the surface of 
capabilities we might want to build into an 
intelligent, collaborative robot. 

The capabilities described above have been 
successfully implemented and demonstrated on 
several mobile robotic platforms (Sofge, et al., 
2004), and we are now porting them to Robonaut. 
We are also extending the capabilities of the 
cognitive architectures (both ACT-R/S and 
Polyscheme) and their perspective-taking cognitive 
models. Future work will focus on enhancing the 
cognitive models through expanded rulesets and 
cognitively plausible behaviors and reasoning 
mechanisms, and adding learning capabilities to the 
models so that the robots may be able to acquire new 
knowledge and skills through interaction with 
humans and while performing tasks. Parts of this 
architecture have already been extended to several 
robots designed specifically for enhanced human 
interaction, such as MIT’s robot Leonardo (Breazeal, 
2003) (Figure 3). While Leonardo is not a humanoid, 
it is being developed with human-like characteristics 
and functionalities. We are also extending the 
architecture and methodology to include and study 
collaboration between teams of robots and humans. 
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Fig. 3. MIT’s Leonardo Robot (photo courtesy 
Cynthia Breazeal, © MIT Media Lab, 2002) 
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